AITopics | data selection technique

Collaborating Authors

data selection technique

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

TextGram: Towards a better domain-adaptive pretraining

Hiwarkhedkar, Sharayu, Mittal, Saloni, Magdum, Vidula, Dhekane, Omkar, Joshi, Raviraj, Kale, Geetanjali, Ladkat, Arnav

arXiv.org Artificial IntelligenceApr-28-2024

For green AI, it is crucial to measure and reduce the carbon footprint emitted during the training of large language models. In NLP, performing pre-training on Transformer models requires significant computational resources. This pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks. Thus, it is important that we select the correct data in the form of domain-specific data from this vast corpus to achieve optimum results aligned with our domain-specific tasks. While training on large unsupervised data is expensive, it can be optimized by performing a data selection step before pretraining. Selecting important data reduces the space overhead and the substantial amount of time required to pre-train the model while maintaining constant accuracy. We investigate the existing selection strategies and propose our own domain-adaptive data selection method - TextGram - that effectively selects essential data from large corpora. We compare and evaluate the results of finetuned models for text classification task with and without data selection. We show that the proposed strategy works better compared to other selection methods.

corpus, dataset, selection, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-58495-4_12

2404.18228

Country:

North America > United States > Oregon > Multnomah County > Portland (0.04)
North America > United States > Massachusetts (0.04)
Asia > India > Tamil Nadu > Chennai (0.04)
Asia > India > Maharashtra > Pune (0.04)

Genre: Research Report (0.40)

Industry: Energy (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Improving Cross-Lingual Transfer Learning by Filtering Training Data : Alexa Blogs

#artificialintelligenceNov-1-2019, 18:46:59 GMT

This type of cross-lingual transfer learning can make it easier to bootstrap a model in a language for which training data is scarce, by taking advantage of more abundant data in a source language. But sometimes the data in the source language is so abundant that using all of it to train a transfer model would be impractically time consuming. Moreover, linguistic differences between source and target languages mean that pruning the training data in the source language, so that its statistical patterns better match those of the target language, can actually improve the performance of the transferred model. In a paper we're presenting at this year's Conference on Empirical Methods in Natural Language Processing, we describe experiments with a new data selection technique that let us halve the amount of training data required in the source language, while actually improving a transfer model's performance in a target language. For evaluation purposes, we used two techniques to cut the source-language data set in half: one was our data selection technique, and the other was random sampling.

data selection technique, target language, transfer model, (10 more...)

#artificialintelligence

Industry: Retail > Online (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)

Add feedback

Amazon researchers reduce data required for AI transfer learning

#artificialintelligenceOct-28-2019, 16:37:01 GMT

Cross-lingual learning is an AI technique involving training a natural language processing model in one language and retraining it in another. It's been demonstrated that retrained models can outperform those trained from scratch in the second language, which is likely why researchers at Amazon's Alexa division are investing considerable time investigating them. In a paper scheduled to be presented at this year's Conference on Empirical Methods in Natural Language Processing, two scientists at the Alexa AI natural understanding group -- Quynh Do and Judith Gaspers -- and colleagues propose a data selection technique that halves the amount of required training data. They claim that it surprisingly improves rather than compromises the model's overall performance in the target language. "Sometimes the data in the source language is so abundant that using all of it to train a transfer model would be impractically time consuming," wrote Do and Gaspers in a blog post.

target language, transfer model, utterance, (10 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.40)

Add feedback